3DS : Cherry Blossom

Lyuda Bekwinknoll, Meghana Cyanam, Theresa Marie Duenas, Kevin Kiser

With our data visualization we are determining the association between age and fitness based on running data from the Cherry Blossom Ten-mile Run held in Washington DC from 1973 to 2020.

Loading and Cleaning Data

# load packages
library(dplyr)
library(ggplot2)
library(chron)
library(plotly)
library(purrr)
library(RColorBrewer)
library(readr)
cb_10M_1973_2005_weather <- read_csv("cb_10M_1973_2005_weather.csv", show_col_types = FALSE)

cb_10M_2006_2019_weather <- read_csv("cb_10M_2006_2019_weather.csv", show_col_types = FALSE)

clean_data <- function(df) {
  df <- dplyr::select(df, Year, Name, Age, 
               Time, Division, pos_by_sex, 
               total_by_sex, Sex, PRCP, TMAX, TMIN)

  df$Age[df$Age == 'NR'] <- 0
  df$Time[df$Time == 'NR'] <- 0

df <- df %>%
    mutate(
      Year = as.integer(Year),
      Name = as.character(Name),
      Age = as.integer(Age, na.rm = TRUE),
      Time = as.character(Time),
      Division = as.character(Division),
      pos_by_sex = as.integer(pos_by_sex),
      total_by_sex = as.integer(total_by_sex),
      Sex = as.character(Sex),
      PRCP = as.numeric(PRCP),
      TMAX = as.integer(TMAX),
      TMIN = as.integer(TMIN)
    )
  
  return(df)
}

df1 <- clean_data(cb_10M_1973_2005_weather)
df2 <- clean_data(cb_10M_2006_2019_weather)


df <- bind_rows(df1, df2)
# fixes times with mins recorded as hrs (mm:ss:00 -> 00:mm:ss)
df <- df %>%
  mutate(
    Time = ifelse(
      Time >= "43:00:00" & Time <= "59:59:59",
      paste0("00:", substr(Time, 1, 2), ":", substr(Time, 4, 5)),
      Time
    )
  )


df <- df %>%
  filter(!is.na(Time) & Time != "" & Time != "00:00:00") %>%
  mutate(Time = chron::times(Time)) %>%
  filter(!is.na(Time)) %>%
  filter(Time >= "00:43:00" & Time <= "02:20:00") %>%
  filter(Age >= 8 & Age <= 87) %>%
  filter(Year != 1977) %>%
  filter(Year != 1973) %>%
  mutate(
    # replaces 'W' with 'F' in sex col
    Sex = ifelse(Sex == 'W', 'F', Sex),
    # correction for missing values by first character in division
    Sex = ifelse(substr(Division, 1, 1) == 'W', 'F', Sex),
    Sex = ifelse(substr(Division, 1, 1) == 'M', 'M', Sex)
  )
# add final data to git repository
# write.csv(df, "cleaned_10M_1973_2019.csv", row.names = FALSE)

unique(df$Division)
##  [1] "M3034" "M0119" "M2024" "M2529" "M3539" "M4549" "M4044" "M5559" "M5054"
## [10] "M6064" "M6569" "M7074" "W2529" "W0119" "W3034" "W2024" "W5054" "W3539"
## [19] "W4549" "W4044" "M7579" "W5559" "W6064" "M8099" "W6569" "W7074" "W7579"
## [28] "W8099"

Describing our Data

Variable Names Data Type Variable Descriptions
Year Integer Year the race was held.
Name Character

An individual’s first and last name with varying formats. Most of the CUCB website results for names also list an 'M', 'F', or 'W' in parenthesis for the individual's sex.

example: James Yenckel (M)

Age Integer Age of runner at time of race.
Time Time/Numeric Time in hr:min:sec format to run 10 miles. This is how long it took each runner to complete the race.
Division Character

28 different divisions are contained, 14 in each sex. They range from 4 of them having 20 year ranges, while the rest have 5 year ranges. Each division is an alphanumeric code separating competitors by sex and age. The example shows 25 to 29-year-old women.

example: W2529

pos_by_sex Integer Shows the place that a runner finished by sex per year.
total_by_sex Integer The total number of competitors overall for a sex per year.
Sex Character Gender of runner.
PRCP Numeric Precipitation recorded as daily rainfall in inches to one decimal place collected by NOAA.
TMAX Integer Minimum daily temperature recorded in Fahrenheit, collected by NOAA.
TMIN Integer Maximum daily temperature recorded in Fahrenheit, collected by NOAA

Dataset Overview:

In the original data set we have 347402 rows and 17 columns. After cleaning the data set we ended up with 339934 rows and 11 columns. 7468 rows of data were omitted from the data we used because they had missing values for the time and/or age variables. Below is the description of the variables and data we excluded for our data analysis/visualization:

What was excluded Reason for exclusion
Hometown Many missing values and inconsistencies were found in the data entries. We found a few individuals reporting their hometown differently each time they ran the race or just reporting several at once. Due to this we decided to remove this variable from our analysis because there are no accurate conclusions that can be drawn. Also this is not a variable we could use to fulfill our main objective, so we chose to exclude it from our analysis.
Distance The data in this column was describing the race of this length which is 10 miles. Since we already know it is the data from the 10 mile race, having a column that explicitly states that for row of our data is redundant.
Date We decided to exclude this variable since we know that the race happens at a certain time each year during spring, and having the specific dates would not impact our data question in any way.
pos_by_div This variable gives us the position that a runner finished in their assigned division for a certain year. There are 28 different divisions, 14 per sex. Each of these divisions includes a age range which differs from 20 years to 5 years. We decided to exclude this variable since the same information could be obtained from the pos_by_sex variable and we could instill our own age ranges.
total_by_division This variable gives the total number of individuals in each division for a certain year. The divisions are the same as described above and are excluded for the same reason as above.
Pace The Pace gave the pace per mile of each runner for the race. We decided to exclude this from our analysis because of the fact it wasn’t reading in correctly. Also the pace can be calculated directly from the Time variable by dividing it by 10 (the total miles in the race). Therefore we decided to remove this column of data from our data frame.
Data from the year of 1977 We decided to remove the data from the year of 1977 due to the fact that there was a large chunk of data missing from the times right in the middle of the race time. We are not given information about what happened in that period that resulted in such record, so we don’t have any background about that. Also if we keep this year in our data analysis, is has a potential to make our data analysis biased since there are a lot of points missing from a main part of the times, which would lead to inaccuracy in our interpretations. That is why we have decided to exclude the year from our data.

Summary Statistics:

Year, Age, Time, Sex main variables to focus on.

Checklist for this section:

summary stats: mean, median, mode, range, sd, percentiles, distributions by sex variable, etc.

mention how many women and how many men in each year and overall

These were helping me evaluate the data cleaning, we can fix or replace them later

x <- df$Time
fit <- density(x)

custom_ticks <- c("00:42", "01:13", "01:45", "02:17", "02:49")
tick_positions <- seq(min(x), max(x), length.out = length(custom_ticks))
tick_labels <- as.numeric(tick_positions)

plot_ly(x = x, type = "histogram", name = "Histogram") %>% 
  add_trace(x = fit$x, y = fit$y, type = "scatter", mode = "lines", fill = "tozeroy", yaxis = "y2", name = "Density Curve") %>% 
  layout(title = "Distribution and Density Curve \n All Years",
         xaxis = list(title = "Time (hr:min)", showline = TRUE, 
                      tickvals = tick_positions, ticktext = custom_ticks),         
         yaxis2 = list(title = "Density", showline = TRUE, overlaying = "y"),
         yaxis = list(showticklabels = FALSE, side = "left"))